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Abstract 

This paper considers the problem of matrix completion when the observed entries are noisy 
and contain outliers. It begins with introducing a new optimization criterion for which the 
recovered matrix is defined as its solution. This criterion uses the celebrated Huber function 
from the robust statistics literature to downweigh the effects of outliers. A practical algorithm 
is developed to solve the optimization involved. This algorithm is fast, straightforward to 
implement, and monotonic convergent. Furthermore, the proposed methodology is theoretically 
shown to be stable in a well defined sense. Its promising empirical performance is demonstrated 
via a sequence of simulation experiments, including image inpainting. 
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1 Introduction 

The goal of matrix completion is to impute those missing entries of a large matrix based on the 
knowledge of its relatively few observed entries. It has many practical applications, ranging from 
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collaborative filtering (Rennie and Srebro, 2005) to computer visions (Weinberger and Saul, 2006) 


to positioning (Montanari and Oh, 2010). In addition, its application to recommender systems 
is perhaps the most well known example, widely made popularized by the so-called Netflix prize 


problem (Bennett and Lanning 2007). In this problem a large matrix of movie ratings is partially 
observed. Each row of this matrix consists of ratings from a particular customer while each column 
records the ratings on a particular movie. In the Netflix dataset, there are around 5 x 10 5 customers 
and 2 x 10 4 movies, with less than 1% of the ratings are observed. Without any prior knowledge, a 
reasonable full recovery of the matrix is virtually impossible. To overcome this issue, it is common 
to assume that the matrix is of low rank, reflecting the belief that the users’ ratings are based on a 
relatively small number of factors. This low rank assumption is very sensible in many applications, 


although the resulting optimizations are combinatorially hard (Srebro and Jaakkola, 2003). To this 
end, various convex relaxations and related optimization algorithms have been proposed to provide 


computationally feasible solutions; see, e.g., Candes and Recht (2009); Candes and Plan (2010); 


Keshavan et al. 

(2010a|b 

); Mazumder et al. 

(2010) 

Marjanovic and Solo 

(2012) and Hastie et al. 


(2014). 

In addition to computational advances, the theoretical properties of matrix completion using 
nuclear norm minimization have also been well studied. For example, when the observed entries 


are noiseless, Candes and Recht (2009) show that perfect recovery of a low rank matrix is possible; 


see also Keshavan et al. (2010a), Gross (2011) and Recht (2011). This result of Candes and Recht 


(2009) has been extended to noisy measurements by Candes and Plan (2010): with high probability, 
the recovery is subject to an error bound proportional to the noise level. Techniques that achieve 


this desirable property are often referred as stable. See also Keshavan et al. (2010b) and Koltchinskii 


et al. (2011) for other theoretical developments of matrix completion from noisy measurements. 


The original formulation of matrix completion assumes those observed entries are noiseless, and 
is later extended to the more realistic situation where the entries are observed with noise. This paper 
further extend the formulation to simultaneously allow for both noisy entries and outliers. To the 
authors knowledge, such an extension has not been considered before, although similar work exists. 


In Candes et al. (2011) a method called principal component pursuit (PCP) is developed to recover 
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a matrix observed with mostly noiseless entries and otherwise a small amount of outliers. This is 


done by modeling the observed matrix as a sum of a low rank matrix and a sparse matrix. Zhou 


et al. (2010) extend this PCP method to noisy entries but assumes the matrix is fully observed, thus 


it does not fall into the class of matrix completion problems. Lastly jChen et al.| (2011) extend PCP 
to safeguard against special outlying structures, namely outlying columns. However, it works only 
on outliers and otherwise noiseless entries. Due to the similarity between the matrix completion and 


principal component analysis, it is worthmentioning that there are some related work (Karhunen 


2011. Luttinen et al. 2012) on robust principal component analysis with missing values. 


The primary contribution of this paper is the development of a new robust matrix completion 
method that can be applied to recover a matrix with missing, noisy and/or outlying entries. This 


method is shown to be stable in the sense of Candes and Plan (2010), as discussed above. As opposed 
to the above referenced PCP approach that decomposes the matrix into a sum of a low rank and 
a sparse matrix, the new approach is motivated by the statistical literature of robust estimation 
which modifies the least squares criterion to downweigh the effects of outliers. Particularly, we make 
use of the Huber function for this modification. We provide a theoretical result that establishes 
an intrinsic link between the two different approaches. To cope with the nonlinearity introduced 
by the Huber function, we propose a fast, simple, and easy-to-implement algorithm to perform the 
resulting nonlinear optimization problem. This algorithm is motivated by the ES-Algorithm for 
robust nonparametric smoothing (Oh et al., 2007). As to be shown below, it can transform a rich 
class of (non-robust) matrix completion algorithms into algorithms for robust matrix completion. 

The rest of this paper is organized as follows. Section [2] provides further background of matrix 
completion and proposes a new optimization criterion for robust matrix recovery. Fast algorithms 
are developed in Section [3] for practically computing the robust matrix estimate. Theoretical and 
empirical properties of the proposed methodology are studied in Section[4]and Section[5]respectively. 
Concluding remarks are given in Section [6j while technical details are relegated to the appendix. 
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2 Matrix Completion with Noisy Observations and Outliers 

Suppose X is an m x n 2 matrix which is observed for only a subset of entries P 0 bs C [m] x [ 712 ], 
where [n] denotes {1,... ,n}. Let 0^ bs be the complement of P obs . Define the projection operator 
Vn ohs as Vii oba B = C, where C tJ = Bij if ( i,j ) G 0 obs and 6^- = 0 if (i, j) 0 0 obs , for any m x n 2 
matrix B = (-Bjj)ie[ni] je[n 2 ]- The following is a standard formulation for matrix completion using 
a low rank assumption: 


minimize rank(T) 

Y 

subject to ^\\V^ obs X - Vn ob Y\\ 2 F < e, 

where e > 0 and || ■ ||f is the Frobenius norm. Carrying out this rank minimization enables a good 
recovery of any low rank matrix with missing entries. Note that for the reason of accommodating 
noisy measurements, the constraint above allows for a slight discrepancy between the recovered and 
the observed matrices. 

However, this minimization is combinatorially hard (e.g., 
fast computation, the following convex relaxation is often used: 

minimize II TIL 

Y 

subject to ^\\Vn ohs X - Vn oh Y f F < e, 

where ||T||* represents the nuclear norm of Y (i.e., the sum of singular values of Y). The Lagrangian 
form of this optimization is 

minimize f(Y\X) = ^\\Vn ohs X - T’a oba T||^ + 7 ||T||*, (1) 

where 7 > 0 has a one-to-one correspondence to e. The squared loss in the first term is used to 
measure the fitness of the recovered matrix to the observed matrix, ft is widely known that such 
a squared loss is very sensitive to outliers and often leads to unsatisfactory recovery results if such 


Srebro and Jaakkola, 2003). To achieve 
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outliers exist. Motivated by the literature of robust statistics (e.g., Huber and Ronchetti, 2011), 
we propose replacing this squared loss by the Huber loss function 


Pc(x) 


x 2 , \x\ < c 

< 

c( 2\x\ — c), \x\ > c 


with tuning parameter c. When comparing with the squared loss, the Huber loss downweighs the 
effects of extreme measurements. Our proposed solution for robust matrix completion is given by 
the following minimization: 


minimize g(Y) = ^ ^ Pci^ij ~ Yij) + 7||R||*. (2) 

(i,j)ef2 obs 

Note that the convexity of p c guarantees the convexity of the objective criterion ([ 2 ]). 

For many robust statistical estimation problems the tuning parameter c is pre-set as c = 1.345<r 
to achieve a 95% statistical efficiency, where a is an estimate of the standard deviation of the noise. 
For the current problem, however, the choice of c is suggested by Theorem [ 2 ] below: c = 'y/y/n^p, 
where n( 1 ) = max{ni,n 2 } and p is the percentage of missing entries. This choice of c was used 
throughout all our numerical work. 


3 Fast Algorithms for Minimization of 


Since the gradient of the Huber function is non-linear, ([ 2 ]) is a harder optimization problem when 
comparing to many typical matrix completion formulations such as 0- As an example, consider 0 
when X is fully observed; i.e., H 0 b s = [n\] x [ 712 ]. Through sub-gradient analysis (e.g., Cai e t al.[ 
Ma et al. 2011), one can derive a closed-form solution to ([!]), denoted as S 1 {X) 1 where S 1 is 


2010 


the soft-thresholding operator defined in Mazumder et al. (2010), also given in ([6]) below. However, 
even if X was fully observed, ([ 2 ]) does not have a closed-form solution. The goal of this section is 
to develop fast methods for minimizing Q. 
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3.1 A General Algorithm 


In Oh et al. (2007) a method based on the so-called theoretical construct pseudo data is proposed 


for robust wavelet regression. The idea is to transform a Huber-type minimization problem into a 
sequence of fast and well understood squared loss minimization problems. This subsection modifies 
this idea and proposes an algorithm to minimizing ([ 2 ]). 


As similar to Oh et al. (2007), we define a pseudo data matrix as 


Z = Vn oh Y + -ME), 


(3) 


where Y is the current estimate of the target matrix, E = Vn ohs X — Vn oh Y is the “residual 
matrix”, and V’c = p' c is the derivative of p c . With a slight notation abuse, when is applied to 
a matrix, it means i)j c is evaluated in an element-wise fashion. Straightforward algebra shows that 
the sub-gradient of f(Y\Z) (with respect to F) evaluated at Y. 


- ('Pn ohs Z - Vn ohs Y) + 7d||y||*, (4) 

is equivalent to the sub-gradient of g(Y) (with respect to Y) evaluated at Y, 

~ \MVn ob X-Vn oh Y) + 1 d\\Y\U- (5) 

The proposed algorithm iteratively updates Y = arg rniiiy f(Y\Z) and Z using Upon conver¬ 
gence (implied by Proposition [l] below), the sub-gradient (jdj) contains 0 at the converged Y and 
thus the sub-gradient ^ also contains 0 at this converged Y. Therefore this Y is the solution 
to Q. Details of this algorithm based on pseudo data matrix are given in Algorithm [lj 

Algorithm [l] has several attractive properties. First, it can be paired with any existing (non- 
robust) matrix completion algorithm (or software), as can be easily seen in Step 2(c). This is 
a huge advantage, as a rich body of existing (non-robust) methods can be made robust against 
outliers. Second, once such an (non-robust) algorithm is available, the rest of the implementation 
is straightforward and simple, and no expensive matrix operations are required. Lastly, it has 
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Algorithm 1 The General Robust Algorithm 

1 : Perform (non-robust) matrix completion on X and assign Y old •<— arg miriy f(Y\X). This Y old 
is the initial estimate (starting point of the algorithm). 

2 : Repeat: 

(a) Compute E <- Vn oh X - Vn ob Y° u . 

(b) Compute Z <- Vn oh Y old + l^ c (E). 

(c) Perform (non-robust) matrix completion on Z and assign Y new •(— arg rriiriy f(Y\Z). 

(d) If 

||y^new _^old||2 

l|r old ||J F<e ’ 

exit. 

(e) Assign Y old «— Y new . 

3: Output Y new . 


strong theoretical backup, as to be reported in Section [4j 


3.2 Further Integration with Existing Matrix Completion Algorithms 

Many existing matrix completion algorithms are iterative. A direct application of Algorithm [l] 
would lead to an algorithm that is iterations-within-iterations. Although our extensive numerical 
experience suggests that these direct implementations would typically converge within a few it¬ 
erations to give a reasonably fast execution time, it would still be advantageous to speed up the 
overall procedure. Here we show that it is possible to further improve the speed of the overall 
robust algorithm by embedding the pseudo data matrix idea directly into a non-robust algorithm. 


We shall illustrate this with the Soft-Impute algorithm proposed by Mazumder et al. (2010). 
To proceed we first recall the definition of their thresholding operator S~: for any matrix Z of rank 


5 7 (Z) = UD^V\ 


( 6 ) 


where Z = UDV 1 is the singular value decomposition of Z. D = diag[di,..., d r \ and D 1 = 
diag[(di — 7 ) + ,..., (d r — 7 )+]. Now the main idea is to suitably replace an iterative matrix es¬ 
timate with the pseudo data matrix estimate given by ([3]). With Soft-Impute, the resulting 
robust algorithm is given in Algorithm [2j We shall call this algorithm Robust-Impute. As to be 
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shown by the numerical studies below, Robust-Impute is very fast and produces very promising 
empirical results. Our algorithm also has the sparse-plus-low-rank structure in the singular value 
thresholding step (Step 2a(iii)). This linear algebra structure has positive impact on the computa¬ 
tional complexity. See Section 5 of Mazumder et ah] (2010) for details. Moreover, the monotonicity 
and convergence of our algorithm is guaranteed by Proposition [l] and Theorem [lj 


Algorithm 2 Robust-Impute 
1: Initialize R old = S 71 (Vn ohs X) and Z = X. 

2: Do for 7i > 72 > ■ ■ ■ > 7 a-: 

(a) Repeat: 

(i) Compute E <- Vn ohs X - Vn oh Y° ld . 

(ii) Compute Z <- Va oh Y old + \^ C {E) 

(iii) Compute T new <- S 7k (Vn obs Z + V^ T old ). 

(iv) If 

||^new _ yold||2 

r old iil 

exit. 

(v) Assign y old e- T new . 

(b) Assign Y lk y new . 

3: Output the sequence of solutions Y 7I ,, Y 7K . 


4 Theoretical Properties 

This section presents some theoretical backups for the proposed methodology. 


4.1 Monotonicity and global convergence 

We first present the following proposition concerning the monotonicity of the algorithms. The proof 


can be found in Appendix A.l 


Proposition 1 (Monotonicity). Let Y^ and = Vci oh Y^ k ^ + tl) c {Vo, ohs X — Va oh Y^ k 1 ' ) )/2 
be, respectively, the estimate and the pseudo data matrix in the k-th iteration. //y( fc +P is the next 
estimate such that /(y(fc+i)| Z (fc+ 1 )) < /(y(D \ Z ( k + 1 )) ; then g(Y^ k+1 ^) < g(Y^). 












For the general version (Algorithm [TJ, it is obvious that the condition f(y^ k+1 )\Z^ k+l ^) < 
/(yi fc )|Z (fc+1 )) is satisfied as the result of the minimization y old •(— arg rriiriy f(Y\Z). For the spe¬ 
cialized version Robust-Impute (Algorithm [2]), this condition is implied by Lemma 2 of Mazumder 


et al. (2010). Therefore both versions are monotonic. 


As pointed out by a referee, the proposed algorithms can also be viewed as an instance of the 


majorization-minimization (MM) algorithm (Lange et al. 2000 Hunter and Lange, 2004). It can 
be shown that, for (i, j) E fl 0 b s , 


1 


Pc{Xij - Yij) < Pc(Xij - Y,T) - (Y.J - YS“»e(Xij - Y,T) + 2 ' - r ( f ? 


Yij - Kf - - Y,f) 


+ constant 


= (Yij — Zij ) 2 + constant 

Therefore, subject to an additive constant that does not depend on Y, h(y|y old ) = f(Y\Z) = 
(1/2) j)en ohs (Zij — Y^) 2 + 7||y||* is a majorization of the objective function g. With this rna- 
jorization, Algorithm [l] can be viewed as an MM algorithm. Additionally, one can majorize the 
unobserved entries by (Y l3 — Z^) 2 = (Y\j — Y.^ d ) 2 > 0 and, together with the above majorization of 
the observed entries, Algorithm [2] can also be shown as an MM algorithm. Therefore the monotonic¬ 
ity of the proposed algorithms can also be obtained by the general theory of MM algorithm (e.g., 


Lange 2010). Moreover, the explicit connection to the MM algorithm allows possible extensions of 


the current algorithm to other robust loss functions such as Tukey’s biweight loss. However, due 
to non-differentiability of the objective function, the typical convergence analysis of MM algorithm 


(e.g., Lange, 2010, Ch. 15) does not apply to our case. 


We summarize the global convergence rates of both Algorithm[l]and Algorithm [2] in the following 
theorem. 


Theorem 1 . Let Y^ and Y^ be, respectively, the estimate in the k-th iteration and the starting 


9 





















point of Algorithm [I] or Algorithm Then for any k > 1, 

ll"Po V(°) — Vn Y* II 2 

^(yW) - g{Y*) < - n ° hs -vy* G y, 

ZiKt 

|| v(o) _ y* II 2 

Algorithm || ff(y (fc) ) - g(Y*) < U-—-Vy* G J>, 

z/c 

where y be the set of all global minimizers of g (i.e. y = arg miriygg™, x« 2 g(Y)). 

The global convergence analysis of Algorithm [l] can be carried out similarly as in Beck and 
Teboulle (2009) for proximal gradient method, despite that Algorithm [l] is not a proximal gradient 
method. For completeness, we give the proof of Theorem [l] for Algorithm [l] in Appendix A.2 


As for Robust-Impute (Algorithm [2]), we can rewrite it as an instance of the proximal gradient 
method applied to g(Y) = gi(Yi)+g 2 (Y 2 ), where gi{Y) = (1/2) S(ij) e n obs PciXij-Yij) and g 2 (Y) = 
7||y||*. In our case, the proximal gradient method with step size L iterates over y( fc+1 ) = Ql(Y^) 
where 


£l(Y) = arg min j g 2 (Y) + ^ 


y - (y - \ygi{Y) 


where L is constant greater than or equal to the Lipstchiz constant of g\ . Note that g\ has a 
Lipschitz contant 1. If we take L = 1, we have the following simplification. 


9200 + ^ 


Y - ( y - \ygi{Y) 


= 92(Y)+ l - 
= 92(Y) + x - 


y ~ {y +lMVn oh X -Vn oh Y) 


Y-\V oA Y + Va h z\ 


The minimization of is equivalent to Step 2a(iii) of Algorithm^ Therefore, the proximal gradient 
method is the same as Robust-Impute. This connection allows us to apply the convergence results 
of proximal gradient method to Robust-Impute directly. Theorem [l] for Algorithm [2] follows from 


Theorem 3.1 of Beck and Teboulle (2009). 


4.2 Stable Recovery 


Recall the stable property of Candes and Plan (2010) implies that, with high probability, the 


recovered matrix is subject to an error bound proportional to the noise level. This subsection 
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shows that the robust matrix completion defined by © is also stable. 

Although the formulation of ([ 2 ]) has its root from classical robust statistics, it is also related to 


the more recent principal component pursuit (PCP) proposed by Candes et al. (2011). PCP assumes 
that the entries of the observed matrix are noiseless, and that this matrix can be decomposed as the 
sum of a low rank matrix and a sparse matrix, where the sparse matrix is treated as the gross error. 


In Candes et al. (2011) it is shown that using PCP perfect recovery is possible with or without 


missing entries in the observed matrix. For the case of noisy measurements without missing entries, 


Zhou et al. (2010) extend PCP to stable PCP (SPCP), which is shown to be stable. However, to 


the best of our knowledge, there is no existing theoretical results for the case of noisy (and/or 
outlying) measurements with missing entries. 


Inspired by She and Owen (2011), we first establish an useful link between robust matrix 


completion ([2]) and PCP in the following proposition. The proof can be found in Appendix A.3 


Proposition 2 (Equivalence). The minimization M) is equivalent to 


minimize ~\\V^ ohs X - Vn ohs ( L + S)\\p + yll-^ll* + c||S||i. 


L,S 


(7) 


That is, the minimizing Y of Q) and the minimizing L of ^ coincide. 

Minimization ([7]) has a high degree of similarity to both PCP and SPCP. It is equivalent to 


minimize ||L||* + A||S||i ( 8 ) 

L,S 

subject to \\Vn oba X - Vn ohs ( L + S) f F < 6 2 , 

where A = c /7 and 6 > 0 has a one-to-one correspondence to 7 . When comparing with PCP, Q 
permits the observed matrix to be different from the recovered matrix ( L + S) to allow for noisy 
measurements. When comparing with SPCP, ([7]) permits missing entries, which is necessary for 
matrix completion problems. 

Proposition [2] has two immediate implications. First, the proposed Algorithm [l] provides a 
general methodology to turn a large and well-developed class of matrix completion algorithms into 
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algorithms for solving SPCP with missing entries. Second, many useful results from PCP can be 


borrowed to study the theoretical properties of robust matrix completion ([2]). In particular, we 
show that ({2]) leads to stable recovery. With Proposition [2j it suffices to show that ([?]) achieves 
stable recovery of (. L 0 ,S' 0 ) from the data Vn oha (X) generated by Vn ohs (L 0 + S 0 ) obeying \\Vn ohB X - 
Rn ohs { L o + S 0 )\\f < $ and S' 0 = Vn obB S 0 . Note that L 0 = X 0 . 

We need some notations to proceed. For simplicity, we assume n = n\ = 712 but our results 
can be easily extended to rectangular matrices ( n\ A 712). The Euclidean inner product (Q,R) is 
defined as trace(Q T i?). Let po be the proportion of observed entries. Write T C 0 o b s as the set of 
locations where the measurements are noisy (but not outliters), and = fl 0 b s \r as the support 
of Sq = Vn oha So] i.e., locations of outliers. Denote their complements as, respectively, and 
We define Vr, Vci, and V^a similarly to the definition of Vn ohs - Let r be the rank of Lq and 
UDV T be the corresponding singular value decomposition of Lq, where U, V £ M rixr and D £ R rxr . 


Similar to Candes et al. (2011), we consider the linear space of matrices 


T := {UQ J + RV J : Q,R e 


'}■ 


Write Vt and V T ± as the projection operator to T and T 1 - respectively. As in Zhou et al. (2010), we 


define a set of notations for any pair of matrices M = ( L,S ). Here, let ||M||i? := y/||L|||, + jjsjj|i 
and ||M||(> := ||L||* + A||S||i. We also define the projection operators Vt x V r ± : ( L,S ) i->- 
(VtL,Vp±S) and V T ± x Vr ■ (L, S) ^ (V T ^-L,VrS). In our theoretical development, we consider 
the following special subspaces 


* := {(4 S):L,Se R nxn , Vn ob L = Vn obs S, V n a L = V n a S = 0} 

obs obs 

^ ■■= {(L, S):L,Se R nxn , Vn oh L + Vn obs S = 0}. 

And we write the corresponding projection operators as Vy and V^,± respectively. Let Mq = 
(Lq, S'0). Lastly, for any linear operator A, the operator norm, denoted by ||A||, is sup||jg|| F=1 i. ||AQ||_f. 
In below, we write that an event occurs with high probability if it holds with probability at least 
1 - 0(n~ 10 ). 
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To avoid certain pathological cases (see, e.g., Candes and Recht, 2009), an incoherence condition 
on U and V is usually assumed. To be specific, this condition with the parameter fi is: 


max ||C/ T ej|| 2 

i 


fir 

— 1 
ni 


max ||R T ei|| 2 < -—, and ||C/R T || 00 < 

i n 2 


fir 

nin 2 ' 


(9) 


where ||Q|| is the operator norm or 2-norm of matrix Q (i.e., the largest singular value of Q) 
and IIQHoo = maxjj \Qi.j\- This condition guarantees that, for small fi, the singular vectors are 
reasonably spread out. 

Theorem 2 (Stable Recovery). Suppose that Lq obeys and f2 0 b s is uniformly distributed among 
all sets of cardinality m = p$n 2 with po > 0 being the proportion of observed entries. Further 
suppose that each observed entry is grossly corrupted to be an outlier with probability r independently 
of the others. Suppose Lq and So satisfy r < /j r n/r _1 (logn)~ 2 and r < t s with p r ,T s being positive 
numerical constants. Choose A = 1 / y/npo ■ Then, with high probability (over the choices of Cl and 
Clobs), for any X obeying \\Vn oba X - Vn oha (L 0 + S 0 )\\f < 5, the solution (L, S) to 0 satisfies 

2 + 8 y/n ^1 + | ^ an d ||-§ — S' 0 \\f < ^2 + 8 y/n ^1 + | \/ n Po'$■ 


I L - L 


0||F < 


where S' Q = Vn oi JS 0 ). 


The proof of this theorem can be found in Appendix A.4 


5 Empirical Performances 


Two sets of numerical experiments and a real data application were conducted to evaluate the 
practical performances of the proposed methodology. In particular the performance of the pro¬ 
posed procedure Robust-Impute is compared to the performance of Soft-Impute developed by 


Mazumder et al. (2010). The reasons Soft-Impute is selected for comparison are that it is one 


of the most popular matrix completion methods due to its simplicity and scalability, and that it 


is shown by Mazumder et al. (2010) that it generally produces superior results to other common 


matrix completion methods such as MMMF of Rennie and Srebro (2005), SVT of Cai et al. (2010) 
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and OptSpace of Keshavan et al. (2010a) 


5.1 Experiment 1: Gaussian Entries 


This experiment covers those settings used in Mazumder et al. (2010, Section 9) and additional 


settings with different proportions of missing entries and outliers. For each simulated data set, the 
target matrix was generated as Xq = UV J , where U and V are random matrices of size 100 x r 
with independent standard normal Gaussian entries. Then each entry of Xo is contaminated by 
additional independent Gaussian noise with standard deviation a, which is set to a value such that 
the signal-to-noise ratio (SNR) is 1. Here SNR is defined as 


SNR = 8 = 


Var(Xo) 


a“ 


where Var(Xo) is the variance over all the entries of Xo conditional on U and V. Next, for each 
entry, with probability p yet another independent Gaussian noise with a/4 is added; these entries 
are treated as outliers. We call this contaminated version of Xq as X. Lastly, fi 0 b s is uniformly 
random over the indices of the matrix with missing proportion as q. In this study, we used two 
values for r (5, 10), three values for p (0, 0.05, 0.1) and three values for q (0.25, 0.5, 0.75). Thus 
in total we have 18 simulation settings. For each setting 200 simulated data sets were generated, 
and both the non-robust method Soft-Impute and the proposed Robust-Impute were applied 
to recover Xq. We also provide two oracle fittings as references. They are produced by applying 
Soft-Impute to the simulated data set with outlying observed entries removed (i.e., treated as 
missing entries), and with outlying observed entries replaced by non-outlying contaminated entries 
(i.e., contaminated by independent Gaussian noise with standard deivation a) respectively. The 
first oracle fitting is referred to as oraclel while the second one is called oracle2 in the following. 

For the two simulation settings with r = 10 and q = 0.5, and one with p = 0 while the other 
with p = 0.1, Figure [I] summarizes the average number of singular value decompositions (SVDs) 
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used and the average test error. Here test error is defined as 


Test error = 


\\V n ± (X 0 -X)\\% 

obs 


where Vr is the projection operator to the set of locations of the observed noisy entries (but not 
outliers) T, and X is an estimate of Xq. From Figure [l] (Top), one can see that the performance 
of Robust-Impute is slightly inferior to Soft-Impute in the case of no outliers (p = 0), while 
Robust-Impute gave significantly better results when outliers were present (p = 0.1). The inferior 
performance of Robust-Impute under the absence of outliers is not surprising, as it is widely 
known in the statistical literature that a small fraction of statistical efficiency would be lost when 
a robust method is applied to a data set without outliers. However, it is also known that the gain 
could be substantial if outliers did present. 

As for computational requirements, one can see from Figure [I] (Bottom) that Robust-Impute 
only used slightly more SVDs on average. For ranks greater than 5, the number of SVDs used by 
Robust-Impute only differs from Soft-Impute on average by less than 1. This suggests that 
Robust-Impute is slightly more computationally demanding than Soft-Impute. 

Similar experimental results were obtained for the remaining 16 simulation settings. For brevity, 
the corresponding results are omitted here but can be found in the attached supplementary docu¬ 
ment. 

From this experiment some empirical conclusions can be drawn. When there is no outlier, 
Soft-Impute gives slightly better results, while with outliers, results from Robust-Impute are 
substantially better. Since that in practice one often does not know if outliers are present or not, 
and that Robust-Impute is not much more computationally demanding than Soft-Impute, it 
seems that Robust-Impute is the choice of method if one wants to be more conservative. 


5.2 Experiment 2: Image Inpainting 

In this experiment the target matrix is the so-called Lena image that has been used by many authors 
in the image processing literature. It consists of 256 x 256 pixels and is shown in Figure [2] (Left). 


15 



0 . 9 - 



-®- nonrobust 
-a- robust 
-a- oracle 1 
-I- oracle2 


10 20 30 40 50 

rank 


1.0 


o 

a> 


0.8 


0.6 



10 20 30 40 50 


rank 


nonrobust 
-a- robust 
-B- oraclel 
-I- oracle2 



Figure 1: Top: The average test errors with their standard error bands (plus or minus one standard 
error). Bottom: The average number of singular value decompositions used with standard error 
bands (plus or minus one standard error). Left: results for the simulation setting: r = 10, p = 0 
and q = 0.5. Right: results for the simulation setting: r = 10, p = 0.1 and q = 0.5. 
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The simulated data sets were generated via contaminating this Lena image by adding Gaussian 
noises and/or outliers in the following manner. First independent Gaussian noise was added to 
each pixel, where the standard deviation of the noise was set such that the SNR is 3. Next, 10% 
of the pixels were selected as outliers, and to them additional independent Gaussian noises with 
SNR 3/4 were added. In terms of selecting missing pixels, two mechanisms were considered. In the 
first one 40% of the pixels were randomly chosen as missing pixels, while in the second mechanism 
only 10% were missing but they were clustered together to form patches. Two typical simulated 
data sets are shown in Figure [2] (Middle). Note that Theorem [2] does not cover the second missing 
mechanism. For each missing mechanism, 200 data sets were generated and both Soft-Impute 
and Robust-Impute were applied to reconstruct Lena. 

The average training and testing erron0 of the recovered images of matrix ranks 50, 75, 100 
and 125 are reported in Table [lj For both missing mechanisms, SOFT-lMPUTE tends to have lower 
training errors, but larger testing errors when compared to Robust-Impute. In other words, 
Soft-Impute tends to over-fit the data, and Robust-Impute seems to provide better results. 
Lastly, for visual evaluation, the recovered image of rank 100 using Robust-Impute is displayed 
in Figure [I] (Right). From this one can see that the proposed Robust-Impute provided good 
recoveries under both missing mechanisms. 


Table 1: The average training and testing errors for the Lena experiment. 


rank 

50 

training error 

75 100 

125 

50 

testing error 

75 100 

125 

independent 

missing 

Soft-Impute 

Robust-Impute 

0.0499 

0.0486 

0.0351 

0.0371 

0.0221 

0.0282 

0.0113 

0.0252 

0.0578 

0.0546 

0.0565 

0.0540 

0.0581 

0.0557 

0.0620 

0.0571 

clustered 

missing 

Soft-Impute 

Robust-Impute 

0.0487 

0.0468 

0.0386 

0.0390 

0.0296 

0.0321 

0.0214 

0.0268 

0.0756 

0.0716 

0.0751 

0.0714 

0.0760 

0.0723 

0.0781 

0.0742 


The solution path (formed by the pre-specified set of y’s) may not contain any solution of rank 50, 75, 100 and 
125. Thus, the average errors were computed over those fittings that contained the corresponding fitted ranks. At 
most 2% of these fittings were discarded due to this reason. 
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Figure 2: Left: the Lena image. Middle: degraded Lena images by the independent missing 
mechanism (Top) and the clustered missing mechanism (Down). Right: corresponding recovered 
images of rank 100 via Robust-Impute. 
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5.3 Real data application: Landsat Thematic Mapper 


In this application the target matrix is an image from a Landsat Thematic Mapper data set publicly 
available at http://ternauscover.science.uq.edu.au/. This data set contains 149 multiband 
images of 100 x 100 pixels, with each image consists of six bands (blue, green and red with three 
infrared bands). The scene is centered on the Tumbarumba flux tower on the western slopes of the 
Snowy Mountains in Australia. Due to wild fires or related reasons, some pixels are of value zero 
which can be treated as missing. Also, due to detector malfunctioning, some isolated pixels have 
values much higher than the remaining pixels, which can be treated as outliers. We selected an 
image band with a high missing rate (27.6%) to test our procedure. 

To evaluate the recovered matrix, the observed pixels were split into training, validation and 
testing sets consisting 80%, 10% and 10% of the observed (nonzero) entries respectively. We used the 
validation set to tune 7. The validation errors are computed in two ways: mean squared error (MSE) 
\J^(i j)ev(^b ~ -X-ij) 2 / |V| and mean absolute deviation (MAD) median{|A/j — Xij\ : (i,j) £ V}, 
where V represents the validation set. Similarly, we compute the testing errors in terms of MSE 
and MAD. Note that the validation and testing sets may contain outliers and therefore MAD serves 
as a robust and reliable performance measure. The corresponding results are shown in Table [2j 
From this table it can be seen that with the presence of outliers, ROBUST-lMPUTEprovided better 
results. 


Table 2: Rank and testing errors of the real data application. 



tuning by MSE 
rank MSE MAD 

tuning by MAD 
rank MSE MAD 

Soft-Impute 

Robust-Impute 

24 45.20 31.15 
24 44.63 29.00 

21 45.23 31.15 
29 44.57 28.76 


6 Concluding remarks 


In this paper a classical idea from robust statistics has been brought to the matrix completion 
problem. The result is a new matrix completion method that can handle noisy and outlying entries. 
This method uses the Huber function to downweigh the effects of outliers. A new algorithm is 
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developed to solve the corresponding optimization problem. This algorithm is relatively fast, easy 
to implement and monotonic convergent. It can be paired with any existing (non-robust) matrix 
completion methods to make such methods robust against outliers. We also developed a specialized 
version of this algorithm, called Robust-Impute. Its promising empirical performance has been 
illustrated via numerical experiments. Lastly, we have shown that the proposed method is stable; 
that is, with high probability, the error of recovered matrix is bounded by a constant proportional 
to the noise level. 


A Technical Details 

A.l Proof of Proposition [l] 

Proof. By rewriting 


\\Vn obs Z^ - Vn oh Y {k+1) III = IIAu> fc+1) - rn ob Y {k) \\l + \\Vn ob P (fc) - Pn ob A (fe+1) III 

2 x trace [{ Vn oha Z (fc+1) - Vn oh Y^}{Vn ob - Vn ob Y^y 


and using /(r( fe+1 )|Z( fe+1 )) < f(Y^\Z^ k+ ^), we have 


2 > 


\\Vn oh p (fe) - Vn oh Y (k+1) \\ 2 F + trace [{ Vn ohB Z^ - Vn oh Y (k) }{Vn oh Y W - Pn ob A (fc+1) } T 


+ 7l|y (fe+1) ||* < 7||y (fe) ||*. 


Thus, by substituting Z< fc+1 ) = Vn ob Y^ + \p' c (Vn ob X - Vn oh Y^), 

\\\^ oh Y [k) ~ ^n ob A (fe+1) III + ^ trace [p' c (Vn ob X - V^ b A W ){Pn ob .Y™ ~ Po ob A (fc+1) } T 


+ 7lA (fc+1) ll* <7lA (fc) ll*- 


(10) 
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Here we abuse the notation slightly so that p' c of a matrix simply means the matrix formed by 
applying p' c to its entries. Note that for each (i,j) E H 0 b s , by Taylor expansion, 


„ x ■ —Y < ' k+1 ' > 

Pc(x» - i ' i f +1) ) = p(x>, - if >) + (if* - if Vc(*« - if 1 ) + f " " 

Xij Y"ij 


{ Xij -Yg +1) -t)&{t)dt 


and the last integral term is less than or equal to (Y^ — Y^ k+1 ^) 2 due to p’f < 2 almost everywhere. 


Thus, 


£ pAX lj -Yff +1) )< Y. PAXij-Y^) 

obs (^jj)^^obs 


+ trace [p' c (Pn ob X - Vn oh Y^){Vn oh Y^ - Pn ob H (fe+1) } 1 


Now, plugging it into (10), we 


have c/(T (fc+1) ) < c/(T (fe) ). 


□ 


A.2 Proof of Theorem [l] for Algorithm [l] 


Proof. This proof closely follows the proofs of Lemma 2.3 and Theorem 3.1 in Beck and Teboulle 
(2009) by modifying their approximation model to 


C(T, Y) = gi(Y) + (Y - Y, V 9l (Y)) + ±\\Vn ob Y - Vn oh Y\\ 2 F + g 2 (Y) 

= gi(Y) -\{Y- Y,MVn oh X - Vn ob Y)) + ~\\Vn oha Y - Vn oh Y\\ 2 F + toF) 

= 9 a{Y) - ^{Pn obs Y - Vn oh Y, ip c (Vn ohB X — Vn oh Y)) + ^\\Vn ob Y - Vn ohs Y \\ 2 F + 52(A), 


where (A, Y) = • A ijYij. It can be shown that argminy f(Y. Y) is the same as argminy f(Y\Z), 
where Z = Vn obs Y + (l/2)ip c (Vn obs X - Vn ohs Y), in Steps 2(a)-(c) of Algorithm [l] Let n(T) = 
argminy £(Y, Y). Therefore Y( fc+1 ) = n(fA fc )). Moreover, 

gi(Y) < 51(A) + (Y- Y,V 91 (Y)) + i|| Vn oh Y - V^ oh Y\\l, 
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for any Y and Y. Therefore, <?(Ll(y)) < £(Ll(y),y) for any Y E M niXn2 . 


To proceed, we need a modified version of Lemma 2.3 in Beck and Teboulle (2009). 
Lemma 1. For any Y , Y E M niX " 2 ; 


1 


g(Y) - g(U(Y)) > -||Pn ob n(y) - Vn ob Y ||£ + (Vn ob Y - Vn ob Y,Vn oh U(Y) - Va oh Y) 


This lemma is proved as follows. Since II(y) is the minimizer of the convex function £(-,y), 
there exists a b(Y) E <9g2(hl(y)), the subdifferential of g 2 at II(y), such that g\{Y) + Vn oh Jl{Y) — 
Vn oh Y + b{Y) = 0. By the convexity of gi and g 2 , 

9 i(Y) > gi (Y) -\{Y- YMVn oh X - Vn ob Y)) 

92 {Y) > g 2 (H(Y)) - (Y - U(Y),b(Y)). 

Therefore, 


g(Y) > 91 (Y) — -{Y — Y, M-Pn ob X - Vn oh Y)) + g 2 (U(Y)) - (Y - U(Y), b(Y)). (11) 


Since g{U(Y)) < ({U(Y), Y), we have g(Y) - g(II(y)) > g(Y) - C(n(T), Y). Plugging in ([llj), the 
dehnition of ( and the condition for b, the conclusion of the lemma follows. 

Using Lemma 0 with Y = Y* and Y = Y^ k \ we have 


2 {g(Y*) - g(Y «)} > II Vn oh Y* - Vn oh Y^ f F - \\Vn ob Y* - Vn ob Y^\\] 


Summing it over k = 0,... ,m — 1, 


771—1 


2 mg(Y*) - V g(Y^) > \\Vn oh Y* - Vn oh Y^f F - \\Vn ob Y* - Vn oh Y^f F . (12) 


k=0 


Applying Lemma 0 with Y = Y = 


-5(^ +1) )} > \\Vn ob Y^ -Vn oh Y^\\ 
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Multiplying it by k and summing over k = 0,... ,m — 1, 


m—1 


m—1 


2 - mg(Y M) + £ <?(T (fc+1) ) > E k\\Vn ob Y^ - Vn ob Y^\\ 2 F . 


(13) 


k=0 


k=0 


Adding (12) and (13), 


2 {g(Y*) - g(Y (H)} > || Vn oh Y* - Vn oh Y^\\l - \\Vn ob Y* - Vn ob Y^\\ 2 F 

m—1 

+ E fc H^o b 8 ^ (fe+ 1 ) -^n ob r (fe) || 

k =0 


Therefore, 




||Pn ob5 P*-Po ob P (0) ||| 

2m 


□ 


A.3 Proof of Proposition [2] 

Proof. Since both ([2]) and Q are convex, we only need to consider the sub-gradients. The sub¬ 
gradient conditions for minimizier of ([2]) are given as follows: 

0 G ~\p' c {Vn oh X - Vn oh Y) + 7 5||r|U, (14) 

where <9|| • ||* represents the set of subgradients of the nuclear norm. The sub-gradient conditions 
for minimizier of 0 are given as follows: 

0 G —Po obs (A — L — S) + 7<9||L||* (15) 

Oe-Vn obs (X-L-S) + cd\\S\\ 1 , (16) 
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where 9|| • ||i represents the set of subgradients of |j • ||i. Here (16) implies, for (i, j) E fl 0 bs , 


Sij — < 


F ij Lij c, Xjj Lij c 


0, 


| X- ij Lij | ^ C 


(17) 


Xij Lij T C, Xij Lij < C 


and = 0 for (i,j) E H^ bs . Note, for (i,j) E O obs , X tJ - L tj - = p' c (Xij - L^)/ 2. Plugging it 

into (15), we have (14) and thus this proves the proposition. □ 


A.4 Proof of Theorem I 


To prove Theorem [2j we first show three lemmas and one proposition. 


Lemma 2 (Modified Lemma A.2 in (Candes et ah, 2011)). Assume that for any matrix Q, 
W'Pt'Py'xQWf < 'n\\'P T ±'P r ±Q\\F■ Suppose there is a pair (W,F) obeying 


r T W = 0, ||TT||<l/2, 

<V r ±F = 0, Halloo <1/2, (18) 

UVT + W + V T D = \(sgn(S' 0 ) + F) with \\V T D\\ F < n~ 2 . 

Then for any perturbation H = (H L , H s ) satisfying Vn ohs H L + Vn obs H s = 0, 


\\M 0 ~ H \\o > UMollo + Q - ^) \\V T xH L \U + Q - "4^ 


Candes et al. 


The proof of this lemma can be found in 
\L\\p + A 2 ||5||^ for any pair of matrices M = (L, S ). 


(2011). To proceed, we write ||M||^ A = 


Lemma 3. Let M = (Ml, Ms) be any pair of matrices. Suppose WVli^VtMlW^ > pq\\VtMl\\ 2 f /2 
and \\VtVli\\ 2 < po/%. Then 


II V 9 (V T x Va)M\\% x > {1 + X p P0 \\(V T x Va)M\\ 2 F . 
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Proof of Lemma [3| Note that for any M' = (M' L , M' s ), 


VyM' = 


Tci^M'l + M' s ) Vn ohs (M' L + M> s ) 


Thus 


\\Vq,(V T X Vq)M\\ 2 fx = 


m + i— WP^C^tMl + VnM s )\\ 2 F 


1 + A 2 


(\\Vn obs V T M L \\ 2 F + \\VnMsWl + 2 (Vn obs V T M L ,V„M s )) 


where the last equality is due to Li C fi 0 bs- By ||TyPn|| 2 < Po/8, 


(' Pn ohs 'PTM L ,'PnMs ) = (' P t M l ,VqMs) 

= ( T t M l , ( VtPq)VqMs) 

> —\\VtVq\\ WVtMlWfWVqMsWf 

> ~^^\\PtM l \\f\\VqM s \\f- 

Combining with \\Vn ohs VTM L \\ 2 F > p 0 \\VtM l \\ 2 f /2, we have 

II Vv(Vt x Vn)M\\ 2 FX > — (^\\VtM l \\ 2 f + \\VnMs\\ 2 F - ^J^-\\VTM L \\ F \\VnMs\\F^ 

As 2(x 2 + y 2 — xy ) > x 2 + y 2 for x, y > 0, 

II V*(V T x Vn)M\\% x > (f \\V T M L f F + \\V a M s \\ 2 F ) > (1 + * )Po \\(V T x Vn)M\\ 


Lemma 4. Let M = (Ml, Ms) be any pair of matrices. Then \\P^M\\ 2 FX < ||M|||, A /2. 












Proof of Lemma 0 Write M* = = P^Af. Since ||M*||f, = ||Mf ||f., 


||P^A/||^ a — ||M®|||i + A 2 ||Af i |'|||, 

= \\f + II^sIIf) + y IIf + \\ m !\\ 2 f ) 

= ;y*lluyllM*lll 
< ~ t ~ y ll M ll*’ = qII^IIpa 


□ 


Proposition 3. Assume that for any matrix Q, \\'Pt'Pt 1 -Q\\f < n\\V T ±V r ±Q\\F and ||Pfi obs 'Pr<2||.F > 
Po\\'PtQ\\f/‘ 2- Further suppose 4/n < A < 1, n > 3, po > 0, ||PtPo|| 2 < Po/8 and that there exists 
a pair ( W,F ) obeying (18). Then the solution M = ( L,S) to |?J) satisfies 


\\M-M 0 \\ F ,x< 


Vl + A 2 + 4 1 + J— (v^ + nA^) 
V Po/ 


where M 0 = (L 0 ,S’ 0 ) such that \\Va obs X - Po obB (L 0 + S 0 ))\\ 2 F < 5 and S’ 0 = Pq obs S 0 . Further, if 
A = 1 / y/npo (which implies l/n < po < n/16), we obtain 

|| L — L\\p < |2 + 8 y/n ^1 + | ^ and 11 ^ — S' 0 \\f < ^2 + 8^/n ^1 + ^ | V n Po^- 

Proof of Proposition^ Write M = M 0 + H , where if = (. H L ,H S ), and H * = = V^H 

and P' £,± = (Hf ± ,F[g ± ) = V x y±H. We want to bound 


TT 11 _ || TJ-'I' I ZJ-\I>- L || 

— 11-“ +-« I|f,a 

< II#* ||f,a + 11^11^ 

< ||^||f,a + ||(Ptx x'Pr)H* ± \\ Fi x + \\(V T xV T ±)H* ± \\F,x. (19) 
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We start with the first term of (19). Since = Hg = (1/2 )Vn oha (HL + Hg) 


H' ! ’\\ F ,x= VY ^\\rn obs (HL + Hs)\\ F 

= ^^\\Vn oha (L + S-L 0 - S')IIf 

< {\\Vn ohB (L + 5 - X)\\ F + \\Vn ohs (Lo + S' 0 - X)|| F ) 

< 5y/l + A 2 , 


where the last inequality is due to the fact that both Mq and M are feasible. 


Then we focus on the second term of (19). First, we have 


HMollo > ||M||o = ||M 0 + tf||o > \\Mq + ||o — \\H ^|| 0 


By Lemma |2j 


WMo + H^Ws > \\M 0 \\ 0 + a(n)\\V T ^Hr\U + b(n,\)\\VrHl 


where 


a(n) = \~~ and b(n, A) = ^ - ^4^. 

An A n z 


Now, combining the above inequalities, 


H*\\ <> >a(n)\\V T ±Hr\U + b(n,\)\\'PrHZ 




By the assumption that A > 4 jn and n > 3, 


1 1 


a{n ) = ->0 and b{n , A) = — — 


A n + 1 2 1 1 1 1 


>-FT =-O > 0. 


2 n v ’ ' 2 n 2 n n n 2 n n 2 


Therefore (20) implies H-H^Ho > a ( n )\\'PT ± ^L ± \\* an d ll-^llo — b(n, X)\\VrHf 


( 20 ) 
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Now, we are ready to establish a bound for the second term of (19). 


\\(V T ± xPr)#* X |kA < \\V T ±H% ± \\ F + \\\‘P r Hl!j ± \\ F 

< \\V T ,Ht\U + X\WrHth 

< 4(||^ll* + A||^f |U). 


As for the third term of (19), we apply Lemma [3] and the bound of the second term in (19). As 
Vq,H' v L = 0, Vy{V T x V v ±)H^ +Vy{V T ± x V r )H' 5,± = 0. Therefore, due to Lemma @ 


\\Vq,(VT x V r ±)H' i ’ || = \\Vx 5>(V t ± x Vr)H' iJ H^a < — j=\\(V T ± x Vr)H' iJ ||f,a- 
As Vq± Hs does not affect the feasibility of M + H and H is chosen such that ||M + H ||q is 

obs 

minimized, thus V n ± Ho = V n ± Hs = 0 which implies (' Vt x V r ±)H'* >± = (Vt x Vn)H' i ’ ± . 

obs obs 

Thus, by Lemma [3j 


\\{Vt X V^)H^\\ f < J {1+ 8 x2)po \\(V T ± X Vr)H^\\ FtX < \\(V T ± x Vr)H^\\ FtX . 

And || {Vt x V t ±.)H^\\ f ^ < \\(V T x V T ±)H^ || F as A < 1. 

Collecting all the above bounds for the three terms, we derive the bound for ||iL|| Fj A : 

||#||f,a <5\/\ + \ 2 + a(i + (\\Hf ||* + A||iLf ||i). 

V V PoJ 


Finally, || Hf II* < y/n\\Hf\\ F , 11 Hg 11 1 = \Jpon 2 \\Hg ||i? (since H$ is supported on fl 0 b s ) and 
\\Hf\\ F = || = \\Vn ohs (H L + H s )\\f/2 < 5. Therefore, 


I^IIf.a < 5 


\/l + A 2 + 4^1 + ^-0 (Vn + nA v /p 0 ) 


Assume that A = 1/y/npo. First we note that, due to A > 4/n, this condition imposes a reasonable 
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coverage of p q: l/n < po < n/ 16. Now we focus on simplifying the bound for 



To prove Theorem [2j we establish one additional lemma. 

Lemma 5. Suppose \\V F — Pq 1 'PT'Pn obs T > T\\ < 1/2- Then for any matrix Q, 

\\Vn obs V T Qf F >^\\VTQ\\ 2 F . 

Proof of Lemma\5\ By the assumptions, for any matrix Q, 


\\rn ohs V T Q\\ 2 F = (Vn ohs V T Q,Vn ohs V T Q) 

= {PtQ ,'Pt'Pci^'PtQ) 

= Po{'PTQ,Po 1 'PT'Pn ohs VTQ) 

= Po [H^tIIf + {PtQ , {Po 1 T l T'Pn obs 'PT - Vt)Q\ 
<P0 {\\VtQ\\ 2 f - 1 -\\VtQ\\ 2 f ^ 

= f rrQUl- 


□ 


Proof of Theorem ^ Recall that we write that an event occurs with high probability if it holds 
with probability at least 1 — 0(n~ 10 ). Due to the asymptotic nature of Theorem [ 2 J we only 
require the conditions of Proposition [3] to hold asymptotically with large probability. By Lemma 
A.3 of Candes et al. (2011), WVtVt^QWf < n-\\'P T ±'P r ±Q\\ F for all Q, with high probability. By 
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Lemma[5]and Theorem 2.6 of Candes et al. (2011) (see also Candes and Recht 2009 Theorem 4.1), 
WPq^VtQWf > ^W'PtQWf for all Q, with high probability. Further, by 


Candes and Recht 


(2009), 


WVtVuW < Po/8 occurs with high probability. Candes et al. (2011, pp. 33-35) show that there exist 


dual certificates ( W,F ) obeying (18) with high probability. For sufficiently large n, the conditions 
of A and po in Proposition [3] are fulfilled. Therefore, Theorem [2] follows from Proposition [3} □ 
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